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Abstract —Given two sets of vectors, A = { ai ,..., cC m } 
and B = { hi,..., b n }, our problem is to find the top- 
t dot products, i.e., the largest | ai • bj | among all possible 
pairs. This is a fundamental mathematical problem that appears 
in numerous data applications involving similarity search, link 
prediction, and collaborative filtering. We propose a sampling- 
based approach that avoids direct computation of all mn 
dot products. We select diamonds (i.e., four-cycles) from the 
weighted tripartite representation of A and B. The probability of 
selecting a diamond corresponding to pair (i,j) is proportional 
to ( di ■ bj) 2 , amplifying the focus on the largest-magnitude 
entries. Experimental results indicate that diamond sampling is 
orders of magnitude faster than direct computation and requires 
far fewer samples than any competing approach. We also apply 
diamond sampling to the special case of maximum inner product 
search, and get significantly better results than the state-of-the- 
art hashing methods. 

I. Introduction 

Finding similar items is a fundamental problem that under¬ 
lies numerous problems in data analysis. Link prediction in a 
graph can be cast as finding similar nodes in the graph [1], [2]; 
customers are recommended similar products [3]; text analysis 
often involves finding similar texts [4], [5]; data cleaning 
requires removal of entries that are essentially identical [6]. 
In these settings, entities are represented as vectors in high¬ 
dimensional feature space, i.e., v £ for some large d. 
Many notions of similarity involve dot products, so a measure 
of distance between v and w is v ■ w. This subsumes cosine 
similarity, common-neighbors, database join operations [7], 
frequent itemset mining, data cleaning [6], etc. Motivated 
by these applications, we study the Maximum All-pairs Dot- 
product (MAD) problem. 

Defn. 1 ((-MAD: Max All-pairs Dot-product). Given 
two sets of d-dimensional vectors A = {di,...,am} and 
B = b n }: find the index pair ( i,j ) that maximizes 

| a) • bj |. More generally, given additional parameter t, find 
the t index pairs {(*i, Ji), • • ■, {it, jt)} corresponding to the 
t largest dot products. 

It is convenient to think of A and B as matrices (A £ K dxm 
and B £ R dx "), where the columns are the corresponding 
vectors. The MAD problem is exactly finding the largest 
entries in the product C = A T B £ R mx ”. 



Fig. 1: Million song dataset [10], [11], [12]: Dataset with 350K 
songs, and 48M user-song entries. Diamond sampling finds top 100 
correlated songs within 7 seconds, while exhaustive search takes 
around 80 seconds. 


The MAD formulation subsumes many existing problems in 
the literature. For the special case where A is a single column 
(equivalently, m = 1), this is the exactly the MIPS (Maximum 
Inner Product Search) problem [8], [9]. Here, we maximize 
the dot product with a among all columns of B. When A = 
B is the adjacency matrix of a graph, this is equivalent to 
finding pairs of nodes with the most common neighbors, a 
fundamental link prediction operation. 

A. Difficulties with previous art 

The most obvious approach is to simply compute A T B ex¬ 
haustively. There is a rich history on algorithms for dense and 
sparse matrix multiplication [13], [14], with implementations 
in libraries like Intel MKL’s BLAS [15] and CSparse [16]. 
However, existing algorithms become prohibitive as the sizes 
of A and B grow, even if they are sparse. 

There is also much literature in approximate matrix mul¬ 
tiplication, which quickly computes an approximate product 
C ~ A T B. Drineas, Kannan, and Mahoney [17], [18] intro¬ 
duced an approach based on sampling rows of A and B to 
minimize Frobenius norm of the error. Much study has been 














done on various sampling strategies [19], [20], [21], [22], [23], 
[24]. These methods are also not suited for MAD, since we 
only care for a few entries (at most, say, 1000) despite the 
matrix having dimensions in the millions. 

An alternate, popular approach for high-dimensional nearest 
neighbor search is some form of dimension reduction, the 
most famous approach being Indyk and Motwani’s Locality 
Sensitive Hashing (LSH) [25], [26], [27]. Recent results by 
Shrivastava and Li extend LSH methods for the MIPS prob¬ 
lem [9]. These approaches usually involve randomly hashing 
the vectors into a few “buckets”, and then searching for vector 
pairs that share many common buckets. For high-dimensional 
data, the maximum dot product is often small in comparison 
to the vector norms, so the similarities are quite small. This 
means that many hashes are required to find the nearest 
neighbors, leading to a storage blowup (typically, two orders 
of magnitude more than the data). This can be quite prohibitive 
even for small data sets. 

B. The sampling approach 

We take a different route, and apply index sampling meth¬ 
ods. The idea is to sample pair (i, j) proportional to some 
function of the dot product a] ■ bj. With enough samples, we 
hope to find the large entries of C = A T B. The earliest ap¬ 
plication of this idea is by Cohen and Lewis, who constructed 
a sampling algorithm for the MIPS problem [28], [8], Their 
approach samples pairs (i, j) proportional to a] • bj [28], [8]. 
Campagna and Pagh give sampling approaches for a variety 
of distance measures [29]. 

We stress that no previous result (either sampling based, 
LSH based, or otherwise) applies directly to the MAD prob¬ 
lem. 

C. Our Contributions 

Diamond sampling: Our main contribution is diamond 
sampling, a new randomized approach to the MAD problem. 
This is inspired by recent work by Jha et al for 4-vertex motif 
detection in large graphs [30], Their idea is to sample 3-paths 
in graphs to estimate counts of 4-vertex motifs. We generalize 
that idea to the matrix product setting, to design a sampling 
procedure for the MAD problem. Diamond sampling is able to 
sample pairs (i,j) proportional to (a) • bj) 2 . The square term 
is a critical improvement over Cohen and Lewis; it allows for 
faster convergence to the top entries of C = A T B. 

Theoretical analysis: We give a theoretical analysis of 
diamond sampling, and prove concentration bounds on its 
behavior. Our analysis shows the eventual convergence of the 
sampling to squared entries, with no assumption whatsoever. 
Previous sampling work required nonnegativity assumption, or 
assumed structural correlations among positive entries [28], 
[8]. We give strong storage bounds on diamond sampling, and 
show that it requires very little overhead. 

Empirical validation: We apply diamond sampling on 
six real-world datasets and show that is extremely efficient. 
Diamond sampling is orders of magnitude faster than exact 
computation and requires far fewer samples compared to 


other matrix sampling approaches. Fig. 1 shows the results 
of diamond sampling on a song dataset with 48M user-song 
entries. We consider A = B to be the matrix where songs are 
columns and attempt to find the top correlated songs. We can 
get the top 100 pairs in an order of magnitude less time than 
exhaustive computation. 

We consider numerous applications in product recommenda¬ 
tion and link prediction, and consistently get to top 10-100 dot 
products, with a speedup of 10-100X over exact computation. 
Furthermore, the number of samples required is much smaller 
than the Cohen-Lewis approach [8]. 

Application to MIPS: Given recent interest in the MIPS 
problem, we also apply diamond sampling to a MIPS problem, 
as used in [9]. We focus on the MovieLens dataset [31], a 
collaborative filtering application. We get significantly better 
precision-recall curves with a maximum precision of 90-100% 
as opposed to 30-65% with asymmetric LSH methods. Our 
running time is a fraction of a second per query. Our diamond 
sampling requires minimal storage overhead and much less 
than LSH methods, which require large amounts of memory, 
easily running into hundred times the dataset size (requiring 
a large-memory machine). 

II. Preliminaries and notation 

We use the standard notation that [n] = { 1,..., n }. For 
x £ R, the function sgn(.T) = 1 if x > 0 and — 1 otherwise; 
i.e., x = sgn(a:)|a:|. Let v £ M" denote a vector and M = 
irriij) £ R mxn be a matrix. We denote the vector and matrix 
p-norms as follows: 

Tb 777 , 77 , 

11*11? = X>I P and M? = EEKP- (D 

i =1 i—1 j =1 

Note that the matrix p-norms are entrywise rather than the 
induced norms. We let nnz(M) denote the number of nonzeros 
in M\ if M is dense, then we suppose without loss of 
generality that nnz(M) = mn. 

We assume throughout that A £ R dxm and B £ R dxn . We 
use k, k! £ \d) to index rows of A and B. The /cth rows of A 
and B (transposed to column vectors) are denoted by a kif and 
bk * respectively. We use i, i' £ [to] to index columns of A, 
whose ith column is denoted by a] or a*j. We use j, j' £ [n] 
to index columns of B, whose jth column is denoted by bj 
or b*j. Since, by definition, C = A r B, we have 

Cij = dj ■ bj = y, a ki b k j for all i £ [m], j £ [n], 
k 

If A and B are binary, i.e., unweighted adjacency matrices, 
then we can consider them as representing a tripartite graph 
on to + d + n nodes; see Fig. 2. From this interpretation, we 
define the neighbor sets, 

A rf- = { k £ [d\ I a ki = 1} , M k = { i £ [to] | a ki = 1 } , 
A ff = {k£ [d] I b kj = 1}, U k = { j £ [n] I b kj = 1} . 


Correspondingly, we can define degrees of the nodes, i.e., 

degf = \K A \ = ||o*i||i,deg^ = \K A \ = ||ofc.||i, 
degf = | = ll^Hi.degf = |7V^| = ||fo fe *||i- 

Random selection is uniform, i.e., equal probability for all 
elements of a discrete set, unless stated otherwise. 

III. Diamond sampling 

Complexity of diamond sampling depends on the two input 
matrices. We start our discussion with the special case of 
binary matrices A and B. We follow with the general case 
and then discuss other special cases such as nonnegative inputs 
and computing the maximum in A T A , i.e., B = A. 

A. Binary inputs 

To motivate our procedure, we start with the case where 
A and B are binary matrices. We can represent this as a 
tripartite graph where the m columns of A, indexed by i, 
correspond to nodes on the left; the n columns of B, indexed 
by j, correspond to nodes on the right; and the d common 
rows of A and B, indexed by k or k', correspond to nodes 
in the center. Edge (*, k) exists iff a k i = 1; likewise, edge 
(k, j) exists iff bkj = 1. Therefore, Cjj is simply the number 
of common neighbors of i and j: 

Cij = at • bj = \{k | k G N A ("I ftff } |. 


(a) Wedge ( i,k,j ) (b) Three-path (c) Diamond 

(k',i,k,j) ( k',i,k,j) 

Fig. 2: Illustration of tripartite graph. For simplicity, we show only 
those edges incident nodes i or j. 

If node k has an ^-neighbor i and a B -neighbor j, then 
we call ( i,k,j ) a “wedge.” The existence of such a wedge 
implies that c, :) > 1. In fact, there are exactly c, 3 distinct 
wedges connecting pair see Fig. 2a. The probability of 

selecting a random wedge with endpoints (i. j ) can be shown 
to be proportional to Cjj [28], [8]. 

In diamond sampling, our goal is find a “diamond” 
( k',i,k,j ) formed by two intersecting wedges, i.e., ( i,k,j ) 
and (j, k!,j); see Fig. 2c. Note that any pair (i,j) participates 
in cfj diamonds (note that we are not requiring k and k! to 
be different). Hence, the probability of selecting a random 
diamond of the form ( k',i,k,j ) is proportional to cfj. 

Sampling random diamonds will expedite identifying the 
largest dot products as compared to sampling random wedges; 
however, sampling random diamonds is more complex. Thank¬ 
fully, we can adapt the arguments of Jha et al. [30] for this 


purpose. Here, the goal is to find a random three-path of 
the form ( k',i,k,j ). If it closes to form a four-cycle, then 
it is a random diamond. Moreover, these samples will be 
uncorrelated. That is, given a set of random 3-paths, those 
that complete to a diamond will form a uniform sample of the 
diamonds. See Fig. 2c for a three-path that closes to form a 
diamond and Fig. 2b for one that does not. 

Finding a random three-path of the form (k',i,k,j) is a 
multi-step procedure, shown in Algorithm 1 and illustrated in 
Fig. 3. In Line 2, we weight each edge (j, k) according to 
the number of three paths it is the center of, i.e., degf deg]® 
(again we do not require k ^ k'), and store the weights in 
a matrix W. Observe that W has the same sparsity pattern 
as A. In Line 6, we select a random edge (i, k ) proportional 
to its weight (see Fig. 3a). To complete the three-path, we 
select a random neighbor of k in B, labeled j in Line 7 (see 
Fig. 3b) and a random neighbor of i in A, labeled k! in Line 8 
(see Fig. 3c). This yields a uniform random three-path. If edge 
(k',j) exists, i.e., b k ’j = 1, then the three-path is a diamond 
and so we increment the counter Xij in Line 9; obviously, 
nnz(A) < s. 


Algorithm 1 Diamond sampling with binary inputs 

Given matrices A £ { 0,1 } mxd and B £ { 0,1 }' ixd . 
Let s be the number of samples, 
l: for (k,i) e [d] <g> [m] do 
2 : w ki «- a ki degf degf 

3: end for 

4: X <— all-zero matrix of size m x n 
5: for £ = 1,..., s do 

6: Sample (k,i) with probability tUfc»/||W||i 

7: Sample j from M k 

8: Sample k' from J\f A 

9: Xij ^ Xij *F b k 'j 

10: end for 

11: Postprocessing (see Algorithm 2) 



Fig. 3: Illustration of diamond sampling in Algorithm 1. For sim¬ 
plicity, we show only those edges incident nodes i or j. 

The largest values in A' correspond to the (likely) largest 
dot products, but we do some further postprocessing to obtain 
the final answer, as shown in Algorithm 2. We are seeking the 
top -t dot products. We have a budget of t' > / dot products, 








where we assume t! <C ran. We let fl s denote the indices of all 
the nonzeros in X and f \i denote the top-/' entries in X\ this 
requires a sort in Line 1 of at most s items (and generally many 
fewer, depending on the proportion of three-paths that close 
into diamonds). We compute the t' dot products in Lines 3 
to 5 at a cost of 0{t'd). Finally, we let il t denote the top -t 
dot products from fV in Line 6, requiring a sort of /;' items. 


Algorithm 2 Postprocessing 

Given fl s = { (i. j) \ x i3 >0}. Let t be the number of top 
dot products, and t' > t be the budget of dot products. 

1: Extract top-// entries of X, i.e., \V. t '\<f and 

n t / £- { (i,j) £ fl s I Xij > ,j') e!! s \ Cl t ' } 

2: C i — all-zero matrix of size m x n 
3: for (i,j) £ f l t i do 
4: Cij ■£- afbj 

5: end for 

6: Extract top-f entries of C, i.e., |fl t | < t and 

flf i { (i,j) £ | Cij ^ CjjjX(i , j ) £ £lt' \ } 


B. General inputs 

We present the binary version as general motivation, but 
our implementation and analysis are based on the diamond 
sampling algorithm for general real-valued A and B in Al¬ 
gorithm 3. In this case, we define the matrix of weights 
W £ R dxn such that 

Wki = \au\ ||a*i||i||&fc*||i for all k £ [d], i £ [to]. 

The weight Wki correspond to the weight of all three paths 
with edge (i. k) at its center. This is computed in Line 2. The 
sampling in Line 6 has the same complexity as in the binary 
case, but the sampling in Lines 7 and 8 now has a nonuniform 
distribution and so has higher complexity than in the binary 
case. The postprocessing is unchanged. 


Algorithm 3 Diamond sampling with general inputs 
Given matrices A £ R mxd and B £ R" xd . 

Let s be the number of samples. 

1: for all ciki 7/ 0 do 

2- VJki i jufcil || || 11| bk* || 1 

3: end for 

4: X <r- all-zero matrix of size m x n 
5: for £ = 1,..., s do 

6: Sample (k,i) with probability tUfci/HILHi 

7: Sample j with probability |&fcj|/||6fc*||i 

8: Sample k! with probability |afc'i|/||a*i||i 

9: Xij i Xij T Sgn (dkibkj^h'i) bk'j 

10: end for 

11: Postprocessing (see Algorithm 2) 


1) Nonnegative inputs: If A and B are nonnegative, the 

only change is that the sign computations can be ignored in 
computing the sample increment in Line 9 in Algorithm 3. 
This avoids potentially expensive random memory accesses. 


2) Equal inputs (Gram matrix): If B = A, then C = A T A 
is symmetric. The matrix X is not symmetric, although E[X] 
is. Hence, we modify X before by inserting the following step 
before the postprocessing in Line 11 in Algorithm 3: 

X <- (X + X t )/2. (2) 

Now X is symmetric, and the forthcoming analysis is unaf¬ 
fected. 

3) Equal symmetric inputs (squared matrix): If B = A and 
A is symmetric, then C = A 1 2 and we can replace Line 9 in 
Algorithm 3 with the following two lines: 

Xij i Xij T Sgll( Clk'i ) bk'j/2 , 

Xkk' ^ Xkk' + sgn(afcj&fcjOfc'j) bk’j/2. 

This exploits the fact that we can swap the role of k and i in 
the initial edge sample. Again, X may not be symmetric, so 
we insert (2) before the postprocessing in Line 11. 

C. Complexity and space 

Let a = nnz(A) and [3 = nuzf B). In the dense case, a = 
md and f3 = nd. The total work is 

0(a + (3 + s log(sa/3)). 

The total storage (not counting the inputs A and B) is 

2 storage(A) + storage(f?) + 5s + 3 1’ + 3 t. 

We give detailed arguments below and in the implementation 
discussion in Section V. 

Preprocessing. For the sampling in Lines 7 and 8, we 
precompute cumulative, normalized column sums for B and 
the same for rows of A, requiring storage of storage ( A) + 
storage(i3) and computation of 0(a + (3). The matrix W has 
the same nonzero pattern as A, so the cost to store it is equal 
to storage(A) and to compute it is O(a). 

Sampling. For a straightforward implementation, the cost 
per sample in Line 6 is 0(log(a)). For Line 7, the cost per 
sample is 0(log(/3/d)); here, we have used the approximation 
nnz (bk*) ~ /3/d. A similar analysis applied for A and Line 8. 
So, the cost per sample is 0(log(a) + log(/3/d) + log(a/m)). 
Without loss of generality, we assume that we need to store 
the three-paths and the summand in Line 9 for a total storage 
of 5s. 

Postprocessing. Conservatively, we require 3 t' storage for 
the ( i,j,Xij or c^) triples in f l t ' and 3t storage for the 
( i,j, c^) triples in Cl t . The sorting requires at most O(slogs) 
time, and usually much less since nnz(X) may be much less 
than s due to only some three-paths forming diamonds and 
concentration, i.e., picking the same (i, j) pair multiple times. 

IV. Analysis of Diamond Sampling 

This section provides a theoretical analysis of diamond 
sampling. We first prove that the expected value of x^ is 
c?./||W||i, and then we prove error bounds on our estimate as 
a function of the number of samples. Unless stated otherwise, 
our analysis applies to the general version of the diamond¬ 
sampling algorithm (Algorithm 3). 










A. Expectation 

For a single instance of Lines 6 to 8 of Algorithm 3, we 
define the event 

Sk'ikj = choosing three-path ( k',i,k,j ). 

Lemma 1. Pr(£ fc < ijy ) = \a ki b kj a k >i\/\\W\h. 

Proof: The probability of choosing three-path ( ki, k,j) 
is (by independence of these choices) the product of the 
following probabilities: that of choosing the center edge (i, k), 
then picking j, and then picking k!. 

Pr (£k’ijk) = Pr(ctr (i,k)) ■ Pr(endpts j and fc'Ictr ( i,k )) 
Whi \bkj\ 

~ Wh ' ' ¥^h 

_ || 

IIW’Hi ||&fc*||i ||«*i||i 

\®kibkjQ'k , i\ 

\\w\h ‘ 

■ 

In what follows, we use Xij e to be the following random 
variable: if i.j are the respective indices updated in the /th 
iteration, X lJ: f = sgn {a k ib k ia k >i)b k 'j. Otherwise, X l3 y = 0. 
Observe that x i3 = 

LEMMA 2. For diamond sampling, E [xij/s\ = c^/||IL||-|. 

Proof: We note that E[a Hj/s] = E [J2 e Xij t e]/s = 
Ef-X^i], (We use linearity of expectation and the fact that 
the Xi jj are i.i.d. for fixed i,j and varying L) 

l] = Pi '(£k'ikj) ■ sgn (a k ib ki a k 'j) bk’j 

k k' 

V'' \ a kibkjO'k , i\ , , \ , 

= 2^ \\ w \\ ’ sgn{a ki b kj a kli ) b klj 

k k' 11 11 

= ||W||i ^ ^ aki ^ kj 

= Wh akibkj ) ak ' ibk ' j ) 

“PFiT (?““'’“) = P%' 


B. Concentration bounds 

We now provide some concentration bounds when all entries 
in A and B are nonnegative. 

LEMMA 3. Fix e > 0 and error probability 8 £ (0,1). Assume 
all entries in A and B are nonnegative and at most K. If the 
number of samples 

s>3K\\W\\ 1 \og(2/8)/(6 2 c 2 ij ), 

then 

p r[|zij||W||i/s - 4| > ec 2 j\\ < 8. 


Proof: Observe that X l3 j is in the range [0, K\. Thus, 
Yi,j,e = Xij t e/K is in [0,1]. Set y i3 = Since y l3 

is the sum of random variables in [0,1], we can apply the 
standard multiplicative Chernoff bound (Theorem 1.1 of [32]). 
This yields Pr [y^ > (1 + e)E [y^]} < exp(-£ 2 E[y„-]/3). 
By Lemma 2, E [y i3 \ = (s / K)(c 2 j/\\W\\f), which is at least 
3 log(2/<5)/£ 2 by choice of s. Hence, Pr [yij > (l+e)E[j/jj]] < 
8/2. Note that y i3 = x l3 / K. We multiply the expression 
inside the Pr[-] by A"||FL||i/s to get the event £jj||lP||i/s > 

(l + e)4- 

Using the Chernoff lower tail bound and identical reasoning, 
we get Pr[a;jj||VU||i/s < (1 — e)c 2 f < <5/2. A union bound 
completes the proof. ■ 

The following theorem gives a bound on the number of 
samples required to distinguish “large” dot products from 
“small” ones. The constant 4 that appears is mostly out of 
convenience; it can be replaced with anything > 1 with 
appropriate modifications to s. 

THEOREM 4. Fix some threshold t and error probability 8 £ 
(0,1). Assume all entries in A and B are nonnegative and at 
most K. Suppose s > 12K ||VU||i \og(2mn / 8) / t 2 . Then with 
probability at least 1 — 8, the following holds for all indices 
i,j and i',j': if Cij > r and Cey < r/4, then Xij > Xi>ji. 

Proof: First consider some dot product c, ; with value 
at least r. We can apply Lemma 3 with s = 1/2 and error 
probability S/mn, so with probability at least 1 — 8/mn , 
xuWWWi/a > 4/2 > t 2 /2. Now consider dot product 
Cj iji < t/ 3. Define y^y and Ye y e as in the proof of 
Lemma 3. We can apply the lower tail bound of Theorem 
1.1 (third part) of [32]: for any b > 2eE [yey], Pr \yey > b] < 
2 \ 

We set b = st 2 /2K ||W||i. From Lemma 2 and the 
assumption that Cj/ji < t/3 and E [yey] = E[xj 'y]/K = 
sc 2 ,y/K\\W\\i < st 2 /(ld)K\\W\\i) < b/2e. Plugging in out¬ 
bound for s, b> (12JT||FU||i \og{2mn/8) /t 2 )-t 2 /(2iL|| W||i) 
= Q\og(/2mn / 8). Hence, Pr \yey > &] < 8/{2mn). Equiva¬ 
lently, Pv[xi'j'\\W\\i/s> t 2 /2] <8/{2mn). We take a union 
bound over all the error probabilities (there are at most mn 
pairs i,j or iff). 

In conclusion, with probability at least 1 — <5, for any pair of 
indices i, j: if > t, then £ij|[PP||i/s > r 2 / 2. If c^- < r/4, 
then iEjj|[kU[|i/s < r 2 / 2. This completes the proof. ■ 

To get a useful interpretation of Lemma 3 and Theorem 4, 
we ignore the parameters e and 8. Let us also assume that 
K = 1, which is a reasonable assumption for most of our 
experiments. Basically, to get a reasonable estimate of Cy, 
we require ||FU||i/ 4 samples. If the value of the f-th largest 
entry in C is r, we require ||FF||i/r 2 samples to find the t- 
largest entries. For instance, on a graph, if we want to identify 
pairs of vertices with at least 200 common neighbors, we can 
set r = 200, and ||kU||i will be the number of (non-induced) 
3-paths in the graph. The square in the denominator is what 
makes this approach work. In Table I of Section VI, we show 
some of the values of ||FU||i/r 2 for particular datasets, where 
r is the magnitude of the largest entry. 














V. Implementation Details 
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Sampling Time Breakdown: MillionSong, 1.00e+07 samples 


We discuss the implementation details for reproducibility, 
but we stress that the implementation is not our primary 
contribution. Nevertheless, careful thought has gone into the 
process and we show that a clever implementation of the 
sampling can improve performance by almost 3x; see Fig. 4. 

A. Sampling from discrete distributions 

We consider two alternative schemes for drawing s samples 
from an arbitrary discrete distribution defined by the vector 
p £ [0, l] p such that YAk=i Pfc = 1- The choice of schemes is 
based on the relative sizes of s and p. 

If the size of the distribution, p, is smaller than the number 
of samples, s, then we use binary search on the cumulative 
sums of p to determine each sample. This requires O(slogp) 
comparisons, plus 0(p) work for the preprocessing to compute 
the cumulative sum. We note that using the alias method 
[33] yields a constant time per search at the same cost for 
preprocessing and storage (up to a constant). The cumulative 
sum requires 0{p) space, and sampled events are stored as 
counts in space O(p). Note that both the binary search and 
sample counter increments involve random (not contiguous) 
memory access. This returns a count vector c of length p such 
that Cfc is the number of occurrences of event k and ° k = s. 

If, on the other hand, the number of samples, s, is less than 
the size of the distribution, p, we can avoid the binary search 
by doing a single sort of the samples, as shown in Algorithm 4. 
This is essentially a variation on merge sort, but we sort only 
one of the two lists and compute the other on the fly. The 
preprocessing involves sorting s random numbers, requiring 
O(slogs) comparisons and O(s) space. Sampled events are 
determined by walking through the sorted samples and the 
probability distribution, computing the cumulative sums along 
the way, requiring 0(n + s ) computations. Sampled events are 
stored explicitly in space O(s). Note that all memory accesses 
in this approach are contiguous (reads and writes). This returns 
an explicit sample list e of length s such that e-| < ■ ■ ■ < e/ : . 


Algorithm 4 Search via sample sorting (for s < p) 

Given probability distribution p S [0, l] p and s = # samples 
1: Tp 1/(0,1) for all i S [s] 

2: Sort the vector r so that ri < • • • < rt 

3: k 4— 1, p 4— Pfc 

4: for i = 1,..., s do 

5: while m > p do 

6: k <— fe + 1, p «— p + pk 

7: end while 

8: et k 

9: end for 


Thus, the cost of sampling and storing s events from a 
discrete distribution of size p is 0(s log(min{s,p}) + p) 
computations and 0(min{s,p}) space. 

B. Diamond sampling with locality optimizations 

The implementation of Algorithm 3 takes advantage of the 
specialized sorting mentioned above as well as locality, as we 
explain. 


colnorms/cumsums 
center weights 
I I random number sort 

I I center sample 

20 - I I center transposes 

I I left sample 

right sample 

IK*— compress 4th edges 
set output entry 

15 . | — compress/sort output 



Fig. 4: Time breakdown for optimized and straightforward imple¬ 
mentations of diamond sampling (Algorithm 3). 


In the preprocessing, in anticipation of the sampling in 
Line 7, we compute a matrix B such that each row is a 
normalized cumulative sum, i.e., b^j = Y2k'<k^k'j/\\bk*\\i- 
We store the matrix B in compressed sparse row (CSR) format, 
so that the entries of bk* are contiguous in memory. Similarily, 
for Line 8, we compute a matrix A such that each column is 
a normalized cumulative sum, i.e., aki = Y^p<i a fci , /ll a «lli- 
We store the matrix A in compressed sparse column (CSC) 
format, so that the entries of are contiguous in memory. 
Note that storage A in CSC format is equivalent to storage 
A T in CSR format, so we need only one data structure. 

We separate lines Lines 6 to 9 into four separate loops, first 
computing s pairs of the form (fc,t), then s /j-neighbors, etc. 

For the samples in Line 6, we use the search via sample 
sorting in Algorithm 4 for choosing the samples from W 
since typically nnz {W) = nnzf/l) s. Because of the way 
that Algorithm 4 works, the pairs {kc,it) for i £ [s] are 
conveniently sorted according to he. 

The sorted values yield data locality for Line 7, where we 
use standard binary search to choose the values jt for i £ [s]. 

We rearrange the s samples in O(s) time so that they are 
ordered according to i(. Then we use standard binary search 
to choose the values k' e for l £ [s] from Line 8. 

Finally, we reorder the samples in O(s) time so that they 
are sorted according to kf t , enabling efficient lookups for bk'j 
values in Line 9. 

Fig. 4 shows a 2.7 times speed-up for our optimized 
implementation of Algorithm 3 versus a straightforward imple¬ 
mentation. In particular, we note the drastic reduction in time 
for center, left, and right samples and setting the output entry 
(which involves searching for the existence of the 4th edge) in 
the optimized implementation. This is due to achieving better 
data locality (i.e., cache performance), and the overheads of 
the reorderings to attain this locality are amortized. 

C. Exact computation (for comparison) 

MAD corresponds to find the highest entries in a matrix- 
matrix product. General high-performance implementations 

























are available. In the dense case, the BLAS interface allows 
access to vendor-tuned libraries like Intel’s Math Kernel 
Library [15] or NVIDIA’s cuBLAS [34]. In the sparse case, 
matrix multiplication is available in CSparse [16], an efficient 
open-source library that is used by MATLAB for many sparse 
computations. The computational cost of matrix multiplication 
is 0(mnd) in the dense case (assuming the classical algorithm 
is used) and 0(^ fc deg[? -deg^), i.e., the number of wedges 
in the tripartite graph, in the sparse case. The storage cost of 
library implementations of matrix multiplication, mn in the 
dense case and up to mn in the sparse case, is generally the 
limiting factor. 

To adapt these high-performance libraries, we perform a 
series of matrix-vector products, cj = A T bj for j £ [n], to 
compute the columns of C one at a time. We do not save 
the columns but instead use a priority queue to track the top- 
/ entries. Because CSparse is open source, we were able to 
modify the code to minimize the memory footprint, achieving 
0(storage(A)+storage(f3)+f), with little loss in performance. 
In the dense case, we compute C in column blocks to size n x 
d , computing and processing the output matrix in chunks using 
the dgemm interface to MKL, so that the memory footprint is 
0(storage(A) + storage(B) + /). 

VI. Experiments 

All experiments are run on an Intel Xeon E5-2650 “Ivy 
Bridge” 2.0 GHz machine with 32 GB of memory. Our 
codes are written in C/C++ with a mex-interface to MATLAB 
(Version 8.3.0.532). The codes are all single-threaded. 

A. Datasets 

We experiment on real-world datasets described below. 

• as-Skitter A T A: Skitter is an internet topology graph 
from the “as-Skitter” dataset from SNAP [35], This yields 
a sparse binary symmetric n x n matrix A with n = 
1,696,415 nodes and 11,095,298 nonzeros. 

• Movielens A T B : The Movielens-IOM data set [31] com¬ 

prises a sparse movie-user matrix, R, of size m x n with 
to = 65,133 movies and n = 71,567 users. Following Shri- 
vastava and Li [9], who in turn followed [3], we compute 
the low-rank SVD of R using d = 150 components, so 
that R « A t B where A £ and B £ M dxn are 

dense real-valued matrices. 

• Live Journal A 7 A: LiveJournal is a free online com¬ 
munity, and we use the “soc-LiveJournall” dataset from 
SNAP [35]. The corresponding “friendship” sparse sym¬ 
metric binary adjacency matrix A has dimension n = 
4,847,571 and 68,993,773 nonzeros. 

• ASIC A T A: The ASIC dataset is a Xyce circuit simu¬ 
lation matrix; we use the “Sandia/ASIC_680k” matrix 
from the Florida matrix collection [36]. The n x n real¬ 
valued matrix A is nonsymmetric (though it is structurally 
symmetric) with dimension n = 682,862 and 2,638,997 
nonzeros. 

• Amazon Kindle A T B: The Amazon product data consists 
of review and product data from Amazon [37], [38]. The 


Kindle category data includes m = 1,406,916 reviewers 
and n = 430,532 books. The reviewer-book rating matrix 
A contains numeric scores from 1-5 for reviewed books 
and has a total of 3,205,546 entries. The book-book 
similarity matrix B contain numeric scores of 1-4 to 
indicate the relationship (i.e., 4 indicates the book have 
been purchased together by someone) with a total of 
11,012,558 entries. 

• Million Song A T A: The Echo Nest Taste Profile Subset 
of the Million Song Dataset contains 48M user-song play 
counts from real users [10], [11], [12]. The resulting user- 
song matrix A has 1,019,318 users, 384,546 songs, and 
48,373,586 user-song play counts. 

B. Time and Accuracy Performance 

We present time and accuracy results for six data sets in 
Figs. 5 and 6. In this study, we vary s as an axis, set t' = s, and 
plot results for / £ { 1,10,100,1000 }. The top row plots the 
recall, i.e., the percentage of the top-/ entries identified, versus 
the number of samples, s. (Not all samples close to form 
diamonds; see Table I.) The bottom row plots the wall-clock 
computation time versus the number of samples, including 
the time for exact computation as described in Section V-C. 
Because the budget /' does not depend on / (we set /' = s), the 
timing is the same for all runs. Table I contains some additional 
data about the sampling, including the size of the largest 
entry, the size of ||kP||i, the ratio ||Wj|i/maxcjj (which is 
proportional to the number of samples needed to recover the 
largest entry according to Theorem 4), and the closure rate of 
the three paths to form diamonds. 


Dataset 

max | Cij | 

II W||i 

est. samples 

closure 

as-Skitter 

3.0e5 

2.9el2 

3e3 

19% 

Live Journal 

3.0e3 

1.5el2 

1.6e5 

13% 

Movielens 

l.leO 

2.8el0 

2.3e8 

100% 

ASIC 

2el2 

9.6el9 

2.4e-5 

100% 

Amazon Kindle 

3.2e3 

1.2el2 

1.2e5 

1.4% 

Million Song 

6.1e6 

1.4el5 

3.8el 

15% 


TABLE I: Summary statistics for datasets. The column labeled “est. 
samples” reports the ratio ||VK||i/maxc;j, the estimated number 
of samples required to find the top entry. The column labeled 
“closure” is the percentage of sampled three paths that correspond to 
a successful diamond sample. 

For as-Skitter {A T A = A 2 because input matrix A is 
symmetric; see Fig. 5), only 10 5 samples are required to 
capture all top 1000 entries in the output matrix, which 
requires 1.25 seconds (dominated by the preprocessing step). 
Exact computation, on the other hand, requires 160 seconds, 
which is 128 times slower. The top entry of this matrix has 
value 30,620, while the number of three paths ||kL||i is 2.9el2. 
The analysis in Section IV suggests that approximately 3000 
samples are required to find the top entry (see Table I), and we 
identified the top entry after 1000 samples in this experiment. 
Likewise, the 1000th top entry in A 2 has value 4,239, the 
analysis suggests 16K samples, and only 10K were needed 
for this experiment. We find all top-1000 entries with only 10 6 
samples, and only 19% of those samples turn into diamonds. 
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Fig. 5: Time and accuracy results for sampling approaches for datasets as-Skitter, Movielens, and Live Journal. The first row of plots 
presents the top-f scores over various numbers of samples, and the bottom row of plots shows the time in logarithmic scale. 






Fig. 6: Time and accuracy results for sampling approaches for datasets ASIC, Amazon Kindle, and Million Song. The first row of plots 
presents the recall scores over various numbers of samples, and the bottom row of plots shows the time in logarithmic scale. 


MovieLens (dense A T B; see Fig. 5) is the most difficult 
dataset because there is not much differentiation between 
the largest entries and smaller ones: the largest entry has 
magnitude 11.02 while the 1000th largest entry has magnitude 
7.37. Also, the estimated number of samples just to get the top 
entry is 10 8 ; see Table I. Here, 10 7 samples is not sufficient 
and anything more requires more time than exact computation. 
The diamond closure rate is 100% because the matrices are 
dense. This is an example of a relatively small dataset where 
sampling is not effective; nevertheless, we still have impressive 
precision-recall results in Section VI-E. 

In LiveJournal (sparse A T A\ see Fig. 5, we have ||W||i = 
1.5 x 10 12 and the top entry is 2997, so we estimate needing 
||H7||/c 2 - « 10 5 samples to find the largest entry (see Table I), 


which is exactly when we find it. We find all top-1000 entries 
with 10' samples and 10X less time than exact computation. 

The ASIC graph (sparse A T A; see Fig. 6) comes from 
scientific computing. Here the largest few entries are very large 
compared to all others. For instance, the predicted number of 
samples is < 1 in Table I. So, we can identify the top-10 but 
struggle to identify the much smaller entries in the remainder 
of the top-1000 (10 million samples identifies only 703 distinct 
output entries). For the top-10, however, we have three orders 
of magnitude speed-up compared to exact computation. 

We can find nearly the top-1000 for Amazon Kindle (sparse 
A T B, see Fig. 6) using 10 7 samples, but the time is coming 
somewhat close to exact computation. This is a relatively small 
problem, and we expect that larger problems will have a more 































































significant benefit. The performance on the recommendation 
application in Fig. 7 yields good results. 

We have 10X speed-up for the top-1000 entries for Million 
Song (sparse A T A, see Fig. 6). We find the top entry after 
1000 samples, which is a bit higher than the estimate in Table I 
of « 100 samples. 

C. Applications of MAD 

Scenario 1 (Free Samples). A review site wants to en¬ 
courage more reviews, so its goal is to select a limited number 
of reviewer-product pairs with the idea that the selected 
reviewer will be given a free item to review. We let A be a 
reviewer-by-product matrix and B is a product “also-bought” 
matrix. We then pick reviewer-product pairs, with the caveat 
that we should not give a reviewer something that they have 
already reviewed. 

We use Amazon purchase data on reviews and product- 
product relationships [37], [38], We focus on the Amazon 
Kindle subset, in which case we recommend e-books to be 
given to particular reviewers to solicit their reviews. Given 
the top recommendations for a free sample, we must check 
that the reviewer has not already reviewed the product; for 
this data set, about half of the top-f pairs correspond to 
new products for the particular reviewer. Fig. 7 shows an 
example top entry: it recommends a romance book to a user 
that has already reviewed many similar romance novels (three 
examples selected at random are shown). 



Fig. 7: Example top pair from Amazon Kindle dataset: user ID 
1317513 and the book entitled “Stay.” The user has reviewed 649 
other books, three random examples of which are shown. 


Scenario 2 (2For1). A retailer wants to select, say, 100 
pairs of products for a 2-for-the-price-of-l promotion. If we 
assume each product has a dense or sparse representation in 
some feature space, this becomes a MAD search with B = A. 
In the case of the Million Song dataset, we let A denote a 
song-by-user matrix where entry (i. j) denotes the number of 
plays of song i by user j. We want to find pairs of distinct 
songs that have the highest number of common plays. 

For this, we find the top-f pairs of similar songs in the 
Million Song dataset, based on being played by the same users. 
We calculate the top-f all-pairs dot products for columns of 
A as explained in Section III-B2. Additionally, we ignore the 
diagonal entries that pair songs with themselves. The top song 
pair using this metric is “Undo” by Bjork and “Revelry” by 
Kings of Leon, which are both in the alternative genre. 


Scenario 3 (Link Prediction). We want to find members 
of a social network that should be connected but are not. These 
can be used as recommendations for new “friends.” 

We use the Live Journal data. The top entry in A 2 is not an 
edge in A; 6 out of top 10 are not; 55 out of top 100 are not; 
and 511 out of top 1000 are not. The top-10 new-connection 
suggestions have over 1800 common neighbors per pair. 

D. Comparison to wedge sampling by Cohen-Lewis 

The most similar related work is that of Cohen and Lewis 
[28], [8], which we refer to as wedge sampling. In this 
section, we compare diamond sampling to wedge sampling and 
present results for the Skitter dataset, which showed the most 
distinction between the methods. We generalized Cohen and 
Lewis’ approach to f-MAD and implemented wedge sampling 
with similar optimizations to those described in Section V-B. 
In general, using diamonds requires fewer (three-path) samples 
than wedge samples to identify top entries in the output matrix. 
In our implementation, the preprocessing cost for diamonds 
is greater than for wedges, but the per-sample costs of each 
method are roughly the same. 

In Fig. 8, we present top-f scores and times for wedge and 
diamond sampling on the Skitter dataset. In these experiments, 
we set the budget of dot products to be f' = 1000 • f. For 
fewer than 10 5 samples, the diamond sampling has much better 
accuracy, but because the time is dominated by preprocessing, 
diamond sampling is also more expensive. For greater than 10 5 
samples, the time is dominated by sampling and computing 
dot products, so the running times of diamond and wedge 
sampling approach are roughly the same. However, diamond 
sampling has identified all top entries by 10 5 samples, while 
wedge sampling needs 10 6 or 10 7 samples to identify all top 
entries, requiring an order of magnitude more time. 




Samples (s) 


Fig. 8: Comparison of diamond sampling and wedge sampling. 


E. Comparison to asymmetric LSH 

We provide a comparison to some experiments in the paper 
by Shrivastava and Li [9] (additional details in [39]) for the 
Movielens-IOM data set [31]. User j corresponds to column 
bj. For each user j, we want to find the top-10 movie 
recommendations. In other words, for each user, we want to 
solve the fc-MIPS problem. 














Precision-recall results over 2000 random users using asym¬ 
metric LSH and an increasing number of hash functions, 
h G { 64,128, 256, 512 }, are reported in [9], [39]. The amount 
of storage increases with h. For comparison, we reproduce the 
curves reports in their figures, although we did not redo the 
experiments. 

We also pick 2000 random users and apply our diamond 
sampling approach in Algorithm 3 repeatedly, using just a 
single column of B in each application. The A matrix is 
approximately 79MB in size. We need to keep one object that 
is the size of A and a few vectors of length m or s, for a total 
of less that 100MB extra storage. While it is hard to pin down 
the exact storage of LSH methods, it is on the order of one to 
two magnitudes more than the dataset. It is well-known that 
LSH is memory intensive (this is explicitly called out in the 
E2-LSH manual [40]). 

We use an increasing number of samples, s £ 
{64,128,256,512}. We set the dot-product budget to be 
t! = s. Average precision-recall curves are shown in Fig. 9. 
It is difficult to compare the methods directly, so we cannot 
say that using 64 samples is comparable to using 64 hash 
functions. However, we can say that the storage per sample is 
much less than the storage per hash. 
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Fig. 9: Comparison of diamond sampling and asymmetric LSH. 
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